This report explores a dataset containing quality and attributes for approximately 1599 wine samples with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## [1] 1599 13
str(red_wine)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Our dataset consists of 13 variables, with almost 1599 observations.
Most of samples have a quality of 5 and 6. There is no samples that have quality value less than 3, Also there is no samples that have quality value more than 8.
In the above graph I showed all histograms for all variables in the dataset. We can see how the data is distributed for each variable. But for better visualizations let???s view each graph individually. That will help us customize our graph for each variable.
fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
It’s a right skewed distribution with a peak at 7 it has a mean of 8.32 and a maximum value of 15.90.
volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
It’s a bimodal distribution which has two peaks at 0.4 , 0.6. I suppose that high levels of volatile acidity will lead to worse wine quality.
citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
It’s right skewed with mean of 0.271 and max of 1. I think high quality wines should contain certain amounts of citric acid.
residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
It a normal distribution with a long tail. The second graph is the log transformation of residual sugar. The mean value is 2.53 and max goes all the way up to 15.50. There is no values close to 45. But there is a few values less than 1.
chlorides: the amount of salt in the wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
It’s a long tailed histogram with 0.087 for mean and 0.079 for median.
free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
## [1] 11 25 15 17 11 13 15 15 9 17 15 17 16 9 52 51 35 16 6 17 29 23 10
## [24] 9 21 11 4 10 14 8 17 22 15 40 13 5 3 13 7 12 12 17 8 9 5 8
## [47] 22 12 5 12 4 8 6 30 33 25 4 50 17 9 19 20 12 13 4 4 11 6 27
## [70] 8 15 17 18 11 28 9 9 14 12 27 3 22 21 16 18 19 20 9 34 8 42 20
## [93] 19 9 41 17 8 3 5 13
It’s right distribution with mean of 15.87 and median of 14.00. Most of values are integers.
total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
It’s right skewed disribution. Most of values are integers which its unit is (mg / dm^3).
density: the density of wine is close to that of water depending on the percent alcohol and sugar content.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
The density distribution is normally distributed which has mean of 0.9967 and median of 0.9968.
pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
It’s normaly distributed which mean is 3.311 and median is 3.310. Most of values between 3 and 3.7.
sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
It’s a right skwed distribution with long tail. its mean is 0.6581 and median is 0.62.
alcohol: the percent alcohol content of the wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
It’s a right skewed distribution which mean is 10.42 and median is 10.20. Wine is alcoholic drink. I wonder how alcohol is related to wine quality.
There are 15999 red wine samples in the dataset with 13 features. All of them are floats except quality and X which are integers.
The main features in the data set are volatile acidity, alcohol and quality. I???d like to determine which features are best for predicting the quality of a wine sample. I suspect volatile acidity, alcohol and some combination of the other variables can be used to build a predictive model to quality.
pH and citric acid likely contribute to the quality of wine. ### Did you create any new variables from existing variables in the dataset? No, I think there is no need to create any variable.
No, there is no need to do any operations on this data set because it’s tidy data. It seemed wrangled and cleaned. There is some unusual outliers but it seems real values.
In the above graph, We see that quality negatively correlated with volatile acidity by 0.4 while it’s positively correlated with alcohol and sulfates by 0.5 and 0.3 respectively.
Generally quality tend to increase when volatile acidity decreased with a negative correlation between them. That’s agreed with our expectations because high levels of it can lead to an unpleasant taste.
In general wines with more alcohol tend to have higher quality values except at quality value of 5.
##
## Pearson's product-moment correlation
##
## data: red_wine$sulphates and red_wine$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
Higher quality wines tend to have more sulphates.
##
## Pearson's product-moment correlation
##
## data: red_wine$citric.acid and red_wine$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
The above graph shows that citric acid median increases when quality increases. The correlation between citric acid and quality is 0.226 though being a weak correlation it do effect the quality of wine.
##
## Pearson's product-moment correlation
##
## data: red_wine$citric.acid and red_wine$volatile.acidity
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
Citric acid is strongly correlated to volatile acidity with a value of -0.5524.
##
## Pearson's product-moment correlation
##
## data: red_wine$density and red_wine$alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5322547 -0.4583061
## sample estimates:
## cor
## -0.4961798
Alcohol is negatively correlated with density by -0.45.
##
## Pearson's product-moment correlation
##
## data: red_wine$chlorides and red_wine$sulphates
## t = 15.978, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3282127 0.4127694
## sample estimates:
## cor
## 0.3712605
There is no strong relationship between sulphates and chlorides. Although they are correlated with a value of 0.37. We can see also that number of outliers increases when sulphates increase.
##
## Pearson's product-moment correlation
##
## data: red_wine$fixed.acidity and red_wine$density
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6399847 0.6943302
## sample estimates:
## cor
## 0.6680473
Fixed acidity and density are strongly correlated with a value of 0.67.
Yes, There is a strong relationship between free sulfur dioxide and total sulfurdioxide which are not contained in my analysis also high correlation between density and fixed acidity was observed.
The strongest relation is between pH and fixed acidity with a correlation value of -0.68.
From the above graphs. We can see that most of wines with quality values greater than 6 have citric acid values greater that 0.25 and alcohol value greater than 11%. Also I data using facet_wrap to show if there is plots overriding.
Most of the lowest quality values have higher volatile acidity. Almost most of quality values lower than 5 have volatile acidity greater than 0.5. While citric acid values vary along with the x axis.
Most of wines with high quality values have alcohol value greater than 11%. So these values tend to have lower density values because alcohol and density are negatively correlated.
It was expected that high quality wines tend to have more alcohol. Also they tend to have higher citric.acid and lower volatile.acidity.
No, I see there is no surprising interactions in my analysis.
This from the univariate plots section. It’s an important graph from it we can see how our samples distributed between quality values. We can see most of out samples take quality value of 5 or 6. 13.5% take a quality value greater than 6. 3.9% take a quality value less than 5.
This plot from bivariate plots section. Wine is an alcoholic drink. So It’s expected that alcohol has an important effect on quality. We can see in this graph that in general wines with more alcohol tend to have higher qualities.
This plot from multivariate plots section.
Most of wines with quality values greater than 6 have citric acid values greater that 0.25 and alcohol value greater than 11%.
The red wine data set contains information on almost 1599 thousand wine sample with 11 variables on the chemical properties of the wine. I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the quality of wines across many variables.
I showed a correlation table between all the variables. It was the most important graph for my analysis. It helped me to restrict my analysis to the most important variables that correlate with each other.
We can see that our dataset has a low number of samples with quality value (3,4) and (7,8). So I think having more samples in general will improve our analysis.
In future work we can do :